The report explore a dataset containing quality and attributes for 1599 red wines with 13 variables. The objective of me is to see which variables influence the quality of red wines
First, I run the basic function to see the overview of the data.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Our dataset consists of 13 variables, with almost 1,599 observations. First, the most important variable that I would like to focus is quality. So, let’s depict this quality using gplot.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Then, I would like to know the characteristic of other variables by drawing up multiple bin-size histograms to see a distribution.
The left graph did not tell me so much detail about the variable. After changing the bin size to 1, I can see that the graph is right skewed and the most wine has the fix acidity ranged from 7 to 10. After that, I transform it using log 10.
Transforming the graph using log10 seems to make Fixed acidity looks like normal distribution. Then, I continue to work on volatile acidity.
For volatile acidity, there are some outliners that the value > 1.1 I try to plot again by triming these value which value less than 95 percentile.
The graph in the center seems to be normal distribution now. Then, I continue to work on Citric Acid.
Citric Acid seems to be right-skewed. So, I tried to apply square root transformation as below.
Applying square root scale on X made citric acid graph looks like normal distribution. However, it is quite obvious that there are many zero data on citric acid. I would like to know how much on this. So I tried to count it.
## [1] 132
There are 132 rows that have citric acid = 0. It is quite unsual on this. I tried to trimmed out this zero value data and tried to plot the graph ag
The bin size = 1 show that there are around 1,100 wines that have residual sugar in the range 1 to 2 and there are very little wine with the value of residual sugar in the range 8 to 16. I also applied the logplot to residual sugar and the graph looks more like bell-shaped.
I have investigated every variable using this method and I saw that some graphs are not normal distributed. I tried to plot of variables that are not normal distribution with scale_x_log10 again to see the result.
It appear that Fixed acidity, Volatile acidity, Chlorides, Total Sulphur Dioxide, Sulphates turned to be normal distribution.
Next, In order to find the unsual distribution, I created Box plot of all variables.
Reading from the description of variables, I saw that some variables may be grouped because they have similar characteristic such as fixed.acidity and volatile.acidity So, I tried to see that if I created 1 new variable “Total Acid” by summing up fixed acidity, volatile acidity and citric acid, it will show any interesting data or not.
df$total.acid <- df$fixed.acidity + df$volatile.acidity + df$citric.acid
qplot(df$total.acid)
First, I want to see the correlationship between every variable in more detail. I use corrplot to do this (with method = number for exact number and square for better visualization).
After looking at the overview of correlation table, I created below graph using jitter plot and boxplot to see the overview of the corelation of each variable and quality.
From the boxplot and jitter plot, I saw that some variables can determine the quality of the wine such as
Then, I tried to explore the correlation value of these above variables with quality of the wine using ggpair and it resulted in the following value.
It seems that density does not relate to quality so much.
Using the same correlation plot, I try to figure out the features that have correlation more than 0.5 which are:
I tried to look into only features that have strong correlation with quality which are
Then I tried to analyse other variables that did not relate to the main feature to see some interesting relation of them. First I group the data by total acid and assign them into 3 classes (low, medium, high).
df$acid.class <- ifelse(df$total.acid < 8, 'low', ifelse(
df$total.acid < 12, 'medium', 'high'))
df$acid.class <- ordered(df$acid.class,
levels = c('low', 'medium', 'high'))
And then, I plot acid class with ph, density, fixed acid and citric acid and got the graph below.
This boxplot is very clear to show that alcohol has an impact to the quality of the wine. The higher alcohol leads to better quality of the wine. However, the outliners show that only alcohol may not produce a good wine quality. I also notice that the quality 3 and 4 have more alcohol that the quality 3 but has lower in quality, which make it a little bit harder to predict the quality using only alcohol.
This graph show that pH, density, fix acidity and citric acid are all related to the amount of acid in the wine. - For pH, the higher pH the lower acid class. - For Fixed Acidity, the lower fixed acidity, the lower acid class. - For density, the lower density, the higher acid class. - For citric acid, the lower citric acid, the lower acid class.
This graph use the two variables that have the most corelation value with quality and plot together with quality with tramming outliners (limit alcohol 9-14 and limit volatile acidity to 0.15-1.2). It shows that the lower volatile acidity and the higher alcohol can lead to better wine quality.
This project help me to be familiar with data analysis using scatterplot, histogram, boxplot, etc. And it is very interesting to find some interesting fact from the data.
The hardest part when I worked with this project is how to extract the important information from the data I have. How can I start and made a right decision to continue in each step. To be more specific on this, I made a decision to choose quality and focus on that which leaded to finding the important correlation between quality and others.
I also think that each variable is quite focus on only chemical components. It may be better if we can use some more variables which are easier to understand such as country, year, color, processes and we may discover some more interesting result.